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youtube-watch-history-analysis 
April 24, 2023 


1 YouTube Wrapped 


I learned recently that Google allows its users to download complete data of our youtube account. 
This leads to interesting insights to be found in our streaming patter, like: Average daily watch 
time, Favorite Video Category, Favorite Channel, etc. In order to explore these and other 
questions I decided to request my data and perform the present analysis. 


The first step is requesting the YouTube watch history data which can be downloaded from Google 
Takeout. For performing data analysis we need the data in Json format. 


1.4 Mounting G-Drive 


from google.colab import drive 
drive.mount('/content/drive') 


Drive already mounted at /content/drive; to attempt to forcibly remount, call 
drive.mount("/content/drive", force remount-True). 


1.2 Import requirements 


from googleapiclient.discovery import build 
import pandas as pd 

import numpy as np 

import seaborn as sns 

import json 

import matplotlib.pyplot as plt 
import matplotlib.animation as anim 
import dateutil 

import random 

from wordcloud import WordCloud 
import requests 

import time 

import isodate 

import nltk 
nltk.download('stopwords') 

from nltk.corpus import stopwords 


[nltk data] Downloading package stopwords to /root/nltk data.. 
[nltk data] Package stopwords is already up-to-date! 


1.3 Read in JSON file from Google Tackout, convert to list, and build youtube 
API 


For extracting the youtube video information from youtube API i took some references from here 


[ ]: api key = 'AIzaSyB1T-SPMB6PYg4jly7WBOkjF jeVCLJX9-I' 
f = open(r'/content/drive/MyDrive/Colab Notebooks/Self practice projects/ 
oYoutube/watch history.json', encoding="UTF-8") 
history = json.load(f) 
history list - [] 
total videos - len(history) 
for i in range(0,len(history)): 
if history[il['header'] == 'YouTube': 
if 'titleUrl' in history[lil: 
video = history[i] ['titleUrl'].split('=',1) [1] 
view date = history[i]['time'] 
if 'details' in history[i] and any(d.get('name') == 'From Google, 
oAds' for d in history[i] ['details']): 
#The data is showing the count of videos including Ads, Hence we, 
oremove the count of Ads to obtain accurate results. 


continue 

history list.append(dict( 
watch date - view date, 
video id - video 


)) 
youtube = build('youtube', 'v3', developerKey=api_key) 
The last line of code uses the build() function from the googleapiclient.discovery module to 
create a client object that can be used to interact with the YouTube Data API. 


The build() function takes several arguments: 


e 'youtube' specifies the name of the API to use (in this case, the YouTube Data API). 


e 'v3' specifies the version of the API to use. 

e 'developerKey-api key' specifies the API key to use for authentication. The resulting 
youtube object can then be used to make requests to the YouTube API, such as retrieving 
additional information about the videos in the user's watch history. 


[ ]: print(len(history list)) 


22132 


2 Function to extract video data using Youtube API 


[ ]: def get video stats(youtube, sample list): 
all data - [] 
all ids - [sub['video id'] for sub in sample list] 
batched ids - [] 
n - 50 
for i in range(0,len(all ids),n): 
batched ids.append(all ids[i:i + nj) 


for i in range(len(batched ids)): 
request - youtube.videos().list( 
part-'snippet,contentDetails,statistics', 
id-batched ids[i]) 


response - request.execute() 


for i in range(len(response["items"])): 
data = dict(video id = response["items"][i]["id"], 
video title = response["items"][i]["snippet"]['title'], 
video description =, 
oresponse["items"][i]["snippet"]['description'], 
published at =, 
oresponse["items"][i]["snippet"]['publishedAt'], 
channel id = response["items"][i]["snippet"]['channelId'], 
category id = response["items"] [i] ["snippet"] ['categoryId'], 
duration =, 
oresponse["items"] [i] ["contentDetails"] ['duration'], 
favorite count ^, 
oresponse["items"][i]["statistics"]['favoriteCount'] 
) 
if 'tags' in response["items"] [i] ["snippet"]: 
data['tag'] = response["items"] [i] ["snippet"] ['tags'] 
else: 
data['tag'] = 'NULL' 


if 'likeCount' in response["items"] [i] ["statistics"]: 
data['like count'] =, 
oresponse["items"] [i] ["statistics"] ['likeCount'] 
else: 
data['like_count'] = 'NULL' 


if 'commentCount' in response["items"] [i] ["statistics"]: 

data['comment count'] =, 
oresponse["items"][i]["statistics"]['commentCount'] 

else: 

data['comment count'] = 'NULL' 
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if 'viewCount' in response["items"][i]["statistics"]: 
data['view count'] =, 


oresponse["items"][i]["statistics"]['viewCount'] 


else: 

data['view count'] = 'NULL' 

if data['video title'] is None: 
print (i) 


all data.append(data) 
return all data 


This code defines a function get video stats that takes two arguments youtube and 
sample list. The function performs the following steps: 


Extracts the video IDs from sample list and stores them in all ids list. 

Breaks down the all ids list into batches of 50 IDs (if there are more than 50) and stores 
each batch in the batched ids list. 

For each batch in batched ids, the function sends a request to the YouTube API to get the 
video statistics (snippet, contentDetails, and statistics) for each video in the batch. 

For each video in the batch, the function extracts the relevant information (video id, 
video title, video description, published at, channel id, category id, duration, fa- 
vorite count, tags, like count, comment, count, and view count) from the API response 


and stores it in a dictionary called data. 
e The data dictionary is then appended to the all data list. 


e Once all batches have been processed, the function returns the all data list containing the 


video statistics for all videos in sample list. 


video stats - get video stats(youtube, history list) 
#print (video stats) 


video data - pd.DataFrame(video stats) 
video data.head() 


video id \ 


O DSgJisejWtw 
1 CqYY vSBfIc 
2 SkKrzGvcaZs 
3 5TecXk118sM 
4  dXvp-xMIruM 
video title \ 
0 


Karate Kid - The Jacket 


1 Ep 2. Who Is 


Vedha? | Vikram Vedha 

2 Virumaandi - Panchayat scene | Kamal Haasan | Napoleon | Pasupathy | 
Abhiramy | 4K [Eng Subs] 

3 Soorarai Pottru - Deleted Scene 3- Arivu Threatens Maara | Sudha Kongara | 


Suriya | 2D Entertain... 
4 Soorarai Pottru - Deleted Scene 7 - Friendship Song | Sudha Kongara | 
Suriya| 2D Entertainment 


video description \ 
0 Karaté Kid 
2010 about the Jacket. 
1 YNOT Studios' #vikramvedha [2017]\nWritten & Directed by Pushkar & 
Gayatri\nProduced by S. Sash... 
2 Stream the full movie now on Amazon Prime Video:-\n 
https://bit.ly/VirumaandionPrimeVideo\n\nb... 
3 Check out Suriya's "Soorarai Pottru" deleted scene of Arivu Threatening 
Maara (Suriya) on 2D En... 
4 Maara has finally started an airline service. His team is recruiting 
candidates for an air hoste.. 


published at channel id category id duration \ 
O 2010-08-13T19:50:45Z  UCwJXET75BbFV-OW91iACnnGA 1 PT7M398 
1 2023-03-31T18:36:54Z  UCqVDSxEb7MNfYvddpbri4TA 1 PT6M46S 
2 2021-01-28T12:30:11Z UC gXhnzeF5 XIFn4gx bocg 22 PT5M208 
3 2021-02-20T108:37:20Z2 UCj6rqKA33Ywu2GTFRDxHhnA 10 PT1M418 
4 2021-02-21T09:30:49Z  UCj6rqKAS3Ywu2GTFRDxHhnA 10 PT1M258 


favorite count \ 


0 0 
1 0 
2 0 
3 0 
4 0 
tag \ 
0 [veste, karate, karaté, kid, jacket, smith, jaden, jackie, chan, 2010, 


trailer, take, off, on] 

1 [tamil movie scenes, vikram vedha, madhavan tamil movies, madhavan mass 
Scene, r madhavan, vijay.. 

2 [Ulaganayagan Tube, Kamal Haasan, Virumaandi, Virumaandi Trailer, virumandi 
Songs, kamal haasan .. 

3 [2d music, 2d entertainment, surya emotional scenes, soorarai pottru 
emotional scene, soorarai p.. 

4 [2d music, 2d entertainment, surya emotional scenes, soorarai pottru 
emotional scene, soorarai p.. 


like count comment count view count 


0 201932 4430 16229286 
1 4599 32 578485 
2 106158 1837 5948038 
3 90541 1267 2820657 


4 70662 1003 1550984 


[ ]: video view - pd.DataFrame(history list) 
video view.head(3) 


Lag watch_date video_id 
2023-04-15T07:52:11.689Z DSgJisejWtw 
2023-04-15T07:49:17.922Z  CqYY vSBfIc 
2023-04-15T07:45:02.091Z SkKrzGvcaZs 


NFO 


3 Merge the two dataframes 


[ ]: final data = video view.merge(video data, how-'left', on-'video id') 


4 How many viewed videos have been taken down? 
[ ]: final data[final data['video title'].isna()].count() #check for missing values 


[ ]: watch date 2331 
video id 
video title 
video description 
published at 
channel id 
category id 
duration 


N 
w 
w 
hen 


favorite count 
tag 

like count 
comment count 
view count 
dtype: int64 


o0o0o00000000O O 


5 Clean Data: Remove NAs and duplicates(not replays, identitcal 
times) 


[ ]: final_data_clean = final_data.dropna().copy() #Remove NAs 
final data clean = final data clean.drop duplicates(['watch date']) #remove,, 
oduplicates 
final_data_clean.head(3) 


Ele watch_date video_id \ 
O 2023-04-15T07:52:11.689Z DSgJ1isejWtw 


3 2023-04-15T07:49:17.922Z2  CqYY vSBfIc 
4 2023-04-15T07:45:02.091Z SkKrzGvcaZs 


video title V 


0 Karate 
Kid - The Jacket 
3 Ep 2. Who Is 


Vedha? | Vikram Vedha 
4 Virumaandi - Panchayat scene | Kamal Haasan | Napoleon | Pasupathy | Abhiramy 
| 4K [Eng Subs] 


video description \ 
0 Karaté Kid 
2010 about the Jacket. 
3 YNOT Studios' #vikramvedha [2017]\nWritten & Directed by Pushkar & 
Gayatri\nProduced by S. Sash... 
4 Stream the full movie now on Amazon Prime Video:-\n 
https://bit.ly/VirumaandionPrimeVideo\n\nb... 


published_at channel_id category_id duration \ 
O 2010-08-13T19:50:45Z UCwJXE75BbFV-OW91iACnnGA 1 PT7M39S 
3 2023-03-31T18:36:54Z UCqVDSxEb7MNfYvddpbri4TA 1 PT6M46S 
4 2021-01-28T12:30:11Z UC gXhnzeF5 XIFn4gx bocg 22 PT5M208 


favorite count \ 


0 0 
3 0 
4 0 
tag \ 
0 [veste, karate, karaté, kid, jacket, smith, jaden, jackie, chan, 2010, 


trailer, take, off, on] 

3 [tamil movie scenes, vikram vedha, madhavan tamil movies, madhavan mass 
Scene, r madhavan, vijay.. 

4 [Ulaganayagan Tube, Kamal Haasan, Virumaandi, Virumaandi Trailer, virumandi 
songs, kamal haasan .. 


like count comment count view count 


0 201932 4430 16229286 
3 4599 32 578485 
4 106158 1837 5948038 
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6 Convert data types 


numeric cols = ['view count','like count','favorite count','comment count'] 
final data clean[numeric cols] = final data clean[numeric cols].apply(pd. 
to numeric, 


oerrors-'coerce',axis = 1) # from object to number 


final_data_clean['watch_date'] = pd.to_datetime(final_data_clean['watch_date'], 
infer_datetime_format=True) #, 
ofrom object to Date 
final_data_clean['published_at'] = pd. 
to datetime(final data clean['published at'], 
infer datetime format-True) #, 
ofrom object to Date 


final data clean['duration sec'] = final data clean['duration'].apply(lambda x:,, 
oisodate.parse duration(x)) # new column for Duration in seconds 


final data clean['duration sec'] = final data clean['duration sec']. 
oastype('timedelta64[s]') £ conversion from int to type seconds 
final data clean.dtypes 


watch date datetime64[ns, UTC] 
video id object 
video title object 
video description object 
published at datetime64[ns, UTC] 
channel id object 
category id object 
duration object 
favorite count float64 
tag object 
like count float64 
comment count float64 
view count float64 
duration sec float64 


dtype: object 


final data clean.isnull().any() 


watch date False 
video id False 
video title False 
video description False 
published at False 
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channel id False 


category id False 
duration False 
favorite count False 
tag False 
like count True 
comment count True 
view count True 
duration sec False 
dtype: bool 


7 What is my favorite Category? 


7.0.1 Replace default numerical representation of category to word. 


You can get the category names list from here. 


final data clean['category id'] = final data clean['category id'].replace(['2', 
es UA M ATTI] EA (uU UD RID D MD UD ZU NDE 


9 VERE) EU CURED Ce ORO USE oy PE og TM ONSE SUM 


> USO BOE NA TAn AS AA 

"Autos & Vehicles','Film & Animation', 'Music', 'Pets &, 
oAnimals', 'Sports', “Short Movies', 

"Travel & Events', 'Gaming', 'Videoblogging', 'People & Blogs' ,u 
«'Comedy', 'Entertainment', 

"News & Politics', 'How to & Style', 'Education', 'Science &, 
oTechnology', ‘Nonprofits & Activism', 'Movies', 

'Anime/Animation', 'Action/Adventure', 'Classics','Comedy',,, 
< Documentary 'Drama', 'Family', 'Foreign', 

'Horror', 'Sci-Fi/Fantasy', 'Thriller', 'Shorts', 'Shows',,, 
o'Trailers']) #replace category id with category name 


final data groupedby category - final data clean. 
ogroupby(['category_id']) ['category_id'].size() .reset_index(name='counts') | 
o#group by category 

final_data_groupedby_category = final_data_groupedby_category. 
sort values(by-['counts'],ascending-False).reset index(drop-True) #sort by, 
«counts 


Science technology educational videos = final data clean[np. 
ological or(final data clean['category id'] == 'Science & Technology' ,u 
ofinal data clean['category id'] == 'Education')] 

pd.set option('display.max colwidth', 100) 
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fig = plt.figure(figsize=(13,7)) 
font = {'family': 'sans-serif','color':  'black','weight': 'normal','size': 16} 
color palette = sns.color palette('Reds r', len(final data groupedby category)) 


ax = sns.barplot(y-'category id', x='counts',,, 
odata-final data groupedby category, orient='horizontal',,, 
2opalette-color palette) 


plt.ylabel('', labelpad-20, fontdict-font) 
plt.xlabel('', labelpad-20, fontdict-font) 
plt.title('Number Of Videos Watched By Category', fontdict-font, pad-20) 


# Add bar Labels 
for i in ax.containers: 
ax.bar label(i, padding-6) 


plt.tick params(axis-'x', which-'both', bottom-False, top=False,,, 
«labelbottom=False) 
plt.tick_params(axis='y', which='both', right=False, left=False, labelleft=True) 


for pos in ['right', 'top', 'bottom', 'left']: # remove the frame 
plit.gcaQ.spines[pos]l.set visible(False) 


plt.show() 


Number Of Videos Watched By Category 


People & Blogs 5105 


8 
^ 


Entertainment 


Sports 3360 


Education 2170 


Music 1499 


Science & Technology 


M 
m 
© 


~ 
w 
[^ 


Film & Animation 


Comedy 


È 
o? uU 
ao 


How to & Style 


News & Politics ERES] 324 


Gaming | | 279 
Travel & Events 214 
Autos & Vehicles 207 
Nonprofits & Activism 42 


Pets & Animals 26 
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8 Whats my favorite day to watch YouTube? 


[ ]: day data = final data clean.copyO 
day data['watch date'] - day data['watch date'].dt.day name() 


[ ]: days = | 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday',u 
4'Sunday'] 


final data groupedby watch day - day data.groupby(['watch date'])['watch date']. 
osize().reset index(name-'count') 

final data groupedby watch day['watch date'] - pd. 
oCategorical(final data groupedby watch day['watch date'], 

categories=days,,, 

~ordered=True) 

final_data_groupedby_watch_day = final_data_groupedby_watch_day. 
osort_values('watch_date') 


[ ]: final_data_groupedby_watch_day 


Li: watch_date count 
1 Monday 2438 
5 Tuesday 2528 
6 Wednesday 2253 
4 Thursday 2019 
0 Friday 2612 
2 Saturday 4135 
3 Sunday 3816 


[ ]: import matplotlib.cm as cm 


fig = plt.figure(figsize-(8,8)) 

font = {'color':  'black','weight': 'normal','size': 16} 

colors - cm.Blues(np.linspace(0.2, 1, len(final data groupedby watch day))) 
counts - final data groupedby watch day['count'] 

labels - final data groupedby watch day['watch date'] 


# Get the index of the slice with the highest percentage 
max_index = counts.argmax() 

explode = [0] * len(counts) 

explode[max index] = 0.1 


plt.pie(counts, labels-labels, colors-colors, autopct-'/1.1f//4',u 
ostartangle--90, textprops={'fontsize': 14}, explode-explode, shadow-True,,, 
ocounterclock-False) 

plt.title('Number Of Videos Watched By Weekday', pad=20, fontdict-font) 


plt.axis('equal') 
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plt.show() 


Number Of Videos Watched By Weekday 
Friday 
Thursday 


Wednesday Saturday 


Tuesday 


Monday 


9 What week did I watch more videos last year? 


copy - final data clean.copyO 
ttone - copy[final data clean['watch date'].dt.year -- 2022] 
by week - ttone.sort values(by-'watch date',ascending-True). 
oreset index(drop-True) 
by week['week num'] = by week['watch date'].dt.strftime('4U') #get week number 
by week.head(1) 


watch date video id \ 
O 2022-01-01 09:52:27.060000+00:00 1UqQhQ7 Mm8 


video title \ 
O Vadivelu Bus Comedy | Aadhavan Comedy Scenes | Vadivelu Comedy | KalaignarTV 
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video description \ 
O Watch Vadivelu Bus comedy scene from the movie Aadhavan. For More latest 
Tamil Movies, Subscribe.. 


published at channel id category id \ 
O 2020-09-14 12:30:03+00:00 UC8-dWVwoMQNOqIdWO4mIHPQ Film & Animation 


duration favorite count \ 
O PT11M538 0.0 


tag \ 
O [Full length Tamil movies, Latest Tamil Films, Tamil movies online, Cult 
hits, Super hit Tamizh .. 


like count comment count view count duration sec week num 
0 18340.0 136.0 2914912.0 713.0 00 


grouped by week = by week.groupby(['week num'])['week num'].sizeO. 

oreset index(name-'videos per week') 
fig = plt.figure(figsize-(13,6)) 
watch trend = sns.barplot(y-'videos per week',,, 

ox-'week num',data-grouped by week,zorder-3, alpha-1, color='#FFC4C4') 
highest count - grouped by week['videos per week'].max() 
for i in range(len(watch trend.patches)): 

if watch trend.patches[i].get height() == highest count: 
watch trend.patches[i]l.set color('&*FF0000') 


plt.title('Videos Watched In 2022 Per Week',pad-20,fontdict-font) 
plt.ylabel('Number of Videos',labelpad-20,fontdict-font) 
plt.xlabel('',labelpad-20,fontdict-font) 

plt.grid(axis-'y', color-'black', linewidth-.5, zorder=0) 
plt.xticks(range(0,53,4)) 


plt.tick params(axis-'x', which-'both', bottom-True, 
top-False, labelbottom-True) 


for pos in ll opi bottoni, rightii: 
plt.gcaO .spines[pos] .set_visible(False) 


plt.show() 
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Videos Watched In 2022 Per Week 


400 


300 


200 it 1 1111 


Number of Videos 


10 Favourite youtube channel 


[ ]: final data groupedby channel =, 
ofinal data clean[final data clean['duration sec'] < 14400].copyO 
final data groupedby channel['duration min'] =, 
final data groupedby channel['duration sec'].div(60).round(2) 
final data groupedby channel - final data groupedby channel. 
ogroupby(['channel id'])['duration min'].sum().reset index(name-'sum') 
final data groupedby channel - final data groupedby channel. 
sort values(by-['sum'],ascending-False).reset index(drop-True) 
final data groupedby channel =, 
final data groupedby channel[final data groupedby channel['sum'] > 1000] 
final data groupedby channel 


def get channel title(cid): 
requesti = youtube.channels().list( 
part-'snippet,contentDetails,statistics', 
id-cid) 


response = requesti.execute() 
title = response['items'][O]['snippet']['title'] 
return title 


final data groupedby channel['channel title'] =, 


ofinal data groupedby channel['channel id'].apply(get channel title) 
final data groupedby channel.head(2) 
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Eds channel id sum channel title 
O UCY6KjrDBN_tIRFT_QNqQbRQ 4245.76 Madan Gowri 
1 UCeVMnSShP Iviwkknt83cww 3217.73  CodeWithHarry 


[ ]: fig = plt.figure(figsize=(13,7)) 
color palette = sns.color palette('Reds r', len(final data groupedby category)) 
watch trend = sns.barplot(y='channel_title',,, 
ox-'sum',data-final data groupedby channel,orient-'horizontal',palette-color palette) 
plt.title('Favorite Channels By View Time',pad-20,fontdict-font) 
plt.ylabel('Channel',labelpad-20,fontdict-font) 
plt.xlabel('Total View Time(H)',labelpad-20,fontdict-font) 
plt.xticks(rotation-80) 
for i in watch trend.containers: 
watch trend.bar label(i,padding-4,) 


plt.tick params(axis-'x', which-'both', bottom-False, 
top-False, labelbottom-False) 

plt.tick params(axis-'y', which-'both', right-False, 
left-False, labelleft-True) 


forspossinal nicht i atop e bot tomuem lett ule: 
plt.gca() .spines[pos] .set_visible (False) 


plt.show() 


Favorite Channels By View Time 


Madan Gowri 4245.76 


CodeWithHarry 3217.73 


Ranveer Allahbadia 3205.29 


AP International 1869.62 


Tomorrowland 1655.15 


Tiésto 1648.92 


Ashwin 1518.07 


BeerBiceps 1476.77 


Channel 


Mr. GK 1446.27 


MotoGP 1418 


Goldmines 1382.76 
Mr. Tamizhan 1208.77 
RJS Cinemas 1166.11 
England & Wales Cricket Board 1045.06 


Gpmuthu Official 1005.17 


Total View Time(H) 


11 What words are most common in the title of videos I have 
watched? 


stop words = set(stopwords.words('english')) 

stop_words.update(['short','statu','GTI','VW', 'Thing']) 

final_data_clean['title_no_stopwords'] = final_data_clean['video_title']. 
oapply (lambda x: [item for item in str(x).split() 


GA if item not in stop words]).copyO 


all words = list([a for b in final data clean['title no stopwords'].tolist(O,, 
for a in b]) 
all words str - ' '.join(all words) 


def plot cloud(wordcloud): 
fig = plt.figure(figsize-(30,20)) 
plt.imshow(wordcloud) 
plt.axis("off"); 


wordcloud - WordCloud(width-1300, height-800, 
background color-'white', 
random state-10, 
max words-300, 
contour width-3, 
collocations-False).generate(all words, str) 
plot cloud(wordcloud) 
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